11 research outputs found

    A Study of Energy and Locality Effects using Space-filling Curves

    Full text link
    The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW

    RVSDG: An Intermediate Representation for Optimizing Compilers

    Full text link
    Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses required by compilers have increasingly shifted to understand data dependencies and work at multiple abstraction layers at the same time. This is partially evidenced in recent developments such as the MLIR proposed by Google. However, rigorous use of data flow centric IRs in general purpose compilers has not been evaluated for feasibility and usability as previous works provide no practical implementations. We present the Regionalized Value State Dependence Graph (RVSDG) IR for optimizing compilers. The RVSDG is a data flow centric IR where nodes represent computations, edges represent computational dependencies, and regions capture the hierarchical structure of programs. It represents programs in demand-dependence form, implicitly supports structured control flow, and models entire programs within a single IR. We provide a complete specification of the RVSDG, construction and destruction methods, as well as exemplify its utility by presenting Dead Node and Common Node Elimination optimizations. We implemented a prototype compiler and evaluate it in terms of performance, code size, compilation time, and representational overhead. Our results indicate that the RVSDG can serve as a competitive IR in optimizing compilers while reducing complexity

    Soil macrofauna communities in Brazilian land-use systems

    Get PDF
    Soil animal communities include more than 40 higher-order taxa, representing over 23% of all described species. These animals have a wide range of feeding sources and contribute to several important soil functions and ecosystem services. Although many studies have assessed macroinvertebrate communities in Brazil, few of them have been published in journals and even fewer have made the data openly available for consultation and further use. As part of ongoing efforts to synthesise the global soil macrofauna communities and to increase the amount of openly-accessible data in GBIF and other repositories related to soil biodiversity, the present paper provides links to 29 soil macroinvertebrate datasets covering 42 soil fauna taxa, collected in various land-use systems in Brazil. A total of 83,085 georeferenced occurrences of these taxa are presented, based on quantitative estimates performed using a standardised sampling method commonly adopted worldwide to collect soil macrofauna populations, i.e. the TSBF (Tropical Soil Biology and Fertility Programme) protocol. This consists of digging soil monoliths of 25 x 25 cm area, with handsorting of the macroinvertebrates visible to the naked eye from the surface litter and from within the soil, typically in the upper 0-20 cm layer (but sometimes shallower, i.e. top 0-10 cm or deeper to 0-40 cm, depending on the site). The land-use systems included anthropogenic sites managed with agricultural systems (e.g. pastures, annual and perennial crops, agroforestry), as well as planted forests and native vegetation located mostly in the southern Brazilian State of Paraná (96 sites), with a few additional sites in the neighbouring states of São Paulo (21 sites) and Santa Catarina (five sites). Important metadata on soil properties, particularly soil chemical parameters (mainly pH, C, P, Ca, K, Mg, Al contents, exchangeable acidity, Cation Exchange Capacity, Base Saturation and, infrequently, total N), particle size distribution (mainly % sand, silt and clay) and, infrequently, soil moisture and bulk density, as well as on human management practices (land use and vegetation cover) are provided. These data will be particularly useful for those interested in estimating land-use change impacts on soil biodiversity and its implications for below-ground foodwebs, ecosystem functioning and ecosystem service delivery.Quantitative estimates are provided for 42 soil animal taxa, for two biodiversity hotspots: the Brazilian Atlantic Forest and Cerrado biomes. Data are provided at the individual monolith level, representing sampling events ranging from February 2001 up to September 2016 in 122 sampling sites and over 1800 samples, for a total of 83,085 ocurrences

    Principles, Techniques, and Tools for Explicit and Automatic Parallelization

    No full text
    The end of Dennard scaling also brought an end to frequency scaling as a means to improve performance. Chip manufacturers had to abandon frequency and superscalar scaling as processors became increasingly power constrained. An architecture’s power budget became the limiting factor to performance gains, and computations had to be performed more energy-efficiently. Designers turned to chip multiprocessors (CMPs) and developers began to employ specialized architectures, such as Graphics Processing Units (GPUs) and Field ProgrammableGate Arrays (FPGAs), to further improve performance while meeting the power envelope. The exploitation of parallelism in an energyefficient manner became the primary way forward. Until the end of Dennard scaling, programs experienced transparent performance gains with every new processor generation. However, CMPs, GPUs, and FPGAs rely on the static extraction of parallelism to improve performance, and programs need to be modified to take advantage of these architectures. Thus, performance gains are no longer achieved transparently, and developers and tools are forced to face new, as well as long-neglected challenges in program parallelization. These challenges include the detection and encoding of potential parallelism in automatic approaches, application portability issues on GPUs, and performance portability issues on CMPs. It is essential to address these challenges, as the continuous increase in computer performance now solely relies on the exploitation of parallelism. This thesis consists of three parts, each addressing one of the aforementioned challenges in program parallelization. The first part addresses the detection and encoding of potential parallelism in automatic approaches. It presents the Regionalized Value State Dependence Graph (RVSDG) as an alternative intermediate representation for optimizing and parallelizing compilers. The RVSDG exposes the hierarchical structure of programs and explicitly models the dependencies between computations, permitting the explicit encoding of concurrent operations and program structures, such as conditionals, loops, and functions. This helps to expose the inherent parallelism in programs and its structures by employing well-known methods for the extraction of instruction level parallelism. The second part addresses application portability issues on GPUs. A GPU’s specialized architecture is optimized for highly regular data-parallel applications, but compromises program performance for workloads with irregular control flow, potentially leading to redundant code execution. We propose a control flow restructuring method to effectively eliminate repeated code execution on GPUs and potentially improve performance. The third part addresses performance portability on CMPs. This issue arises as developers overfit their application to a specific architecture, which results in suboptimal performance for different program inputs or different architectures. We improve performance analysis for OpenMP programs by addressing the scalability challenges of the grain graph visualization method. We present an aggregation method for grain graphs that hierarchically groups related nodes into a single node. This aggregated graph can then be navigated by progressively uncovering nodes with performance issues, while hiding unrelated regions of the graph. This enhances productivity by enabling developers to understand performance problems of highly-parallel OpenMP programs more easily. The insights and techniques developed by addressing these three challenges may result in improved methods and tools for the exploitation of parallelism. The RVSDG is a promising IR for parallelizing compilers, as it permits the encoding of concurrent computations. The grain graph offers a familiar structural view to developers along with the performance issues of a particular program. In the future, it is necessary to cast these ideas into mature tools to make them applicable in practice and foster further research

    Diagnosing Highly-Parallel OpenMP Programs with Aggregated Grain Graphs

    No full text
    Grain graphs simplify OpenMP performance analysis by visualizing performance problems from a fork-join perspective that is familiar to programmers. However, when programmers decide to expose a high amount of parallelism by creating thousands of task and parallel for-loop chunk instances, the resulting grain graph becomes large and tedious to understand. We present an aggregation method that hierarchically groups related nodes together to reduce grain graphs of any size to one single node. This aggregated graph is then navigated by progressively uncovering groups and following visual clues that guide programmers towards problems while hiding non-problematic regions. Our approach enhances productivity by enabling programmers to understand problems in highly-parallel OpenMP programs with less effort than before

    RVSDG: An intermediate representation for optimizing compilers

    No full text
    Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses required by compilers have increasingly shifted to understand data dependencies and work at multiple abstraction layers at the same time. This is partially evidenced in recent developments such as the Multi-Level Intermediate Representation (MLIR) proposed by Google. However, rigorous use of data flow centric IRs in general purpose compilers has not been evaluated for feasibility and usability as previous works provide no practical implementations

    Towards Fine-Grained Dynamic Tuning of HPC Applications on Modern Multi-Core Architectures

    No full text
    There is a consensus that exascale systems should operate within a power envelope of 20MW. Consequently, energy conservation is still considered as the most crucial constraint if such systems are to be realized. So far, most research on this topic has focused on strategies such as power capping and dynamic power management. Although these approaches can reduce power consumption, we believe that they might not be sufficient to reach the exascale energy-efficiency goals. Hence, we aim to adopt techniques from embedded systems, where energy-efficiency has always been the fundamental objective. A successful energy-saving technique used in embedded systems is to integrate fine-grained autotuning with dynamic voltage and frequency scaling. In this paper, we apply a similar technique to a real-world HPC application. Our experimental results on a HPC cluster indicate that such an approach can save up to 19% of energy compared to the baseline configuration, with negligible performance loss
    corecore